\section{A Use Case Example}


\subsection{Predicting Arrival Delay in Minutes - Regression}

\textbf{What is the task?}

As a chief air traffic controller, your job is to come up with a prediction engine that can be used to tell
passengers whether an incoming flight will be delayed by X number of minutes. To accomplish this task, we have an
airlines dataset containing ${\sim}$44k flights since 1987 with features such as: origin and destination codes,
distance traveled, carrier, etc.  The key variable we are trying to predict is "ArrDelay"" (arrival delay) in
minutes. We will do this leveraging H2O and the Spark SQL library.

\textbf{SQL queries from Spark}

One of the many cool features about the Spark is the ability to initiate a Spark Session within our
application that enables us to write SQL-like queries against an existing \texttt{DataFrame}. Given the ubiquitous
nature of SQL, this is very appealing to data scientists who may not be comfortable yet with Scala / Java / Python, but
want to perform complex manipulations of their data.

Within the context of this example, we are going to first read in the airlines dataset and then process a weather
file that contains the weather data at the arriving city. Joining the two tables will require a Spark Session such that
we can write an INNER JOIN against the two independent \texttt{DataFrame}s.

The full source for the application is here: \url{http://bit.ly/1mo3XO2}

Let's get started!

First, create Spark Session and H2OContext:

\begin{lstlisting}[style=Scala]
import org.apache.spark.SparkConf
import ai.h2o.sparkling._
import org.apache.spark.sql.SparkSession
val conf = new SparkConf().setAppName("Sparkling Water: Join of Airlines with Weather Data")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val h2oContext = H2OContext.getOrCreate()
\end{lstlisting}

\newpage
Read the weather data:

\begin{lstlisting}[style=Scala]
val weatherDataFile = "examples/smalldata/chicago/Chicago_Ohare_International_Airport.csv"
val weatherTable = spark.read.option("header", "true")
  .option("inferSchema", "true")
  .csv(weatherDataFile)
  .withColumn("Date", to_date(regexp_replace('Date, "(\\d+)/(\\d+)/(\\d+)", "$3-$2-$1")))
  .withColumn("Year", year('Date))
  .withColumn("Month", month('Date))
  .withColumn("DayofMonth", dayofmonth('Date))
\end{lstlisting}

Also read the airlines data:

\begin{lstlisting}[style=Scala]
val airlinesDataFile = "examples/smalldata/airlines/allyears2k_headers.csv"
val airlinesTable = spark.read.option("header", "true")
  .option("inferSchema", "true")
  .option("nullValue", "NA")
  .csv(airlinesDataFile)
\end{lstlisting}

We load the data tables using Spark and also use Spark to do some data scrubbing.

Select flights destined for Chicago (ORD):
\begin{lstlisting}[style=Scala]
val flightsToORD = airlinesTable.filter('Dest === "ORD")
println(s"\nFlights to ORD: ${flightsToORD.count}\n")
\end{lstlisting}

At this point, we are ready to join these two tables which are currently Spark DataFrames:
\begin{lstlisting}[style=Scala]
val joined = flightsToORD.join(weatherTable, Seq("Year", "Month", "DayofMonth"))
\end{lstlisting}

\newpage
Run deep learning to produce a model estimating arrival delay:

\begin{lstlisting}[style=Scala]
import ai.h2o.sparkling.ml.algos.H2ODeepLearning
val dl = new H2ODeepLearning()
    .setLabelCol("ArrDelay")
    .setColumnsToCategorical(Array("Year", "Month", "DayofMonth"))
    .setEpochs(5)
    .setActivation("RectifierWithDropout")
    .setHidden(Array(100, 100))
val model = dl.fit(joined)
\end{lstlisting}

More parameters for Deep Learning and all other algorithms can be found in H2O documentation at \url{http://docs.h2o.ai}.

Now we can run this model on our test dataset to score the model against our holdout dataset:
\begin{lstlisting}[style=Scala]
val predictions = model.transform(joined)
\end{lstlisting}

